Refining Duplicate Detection for Improved Data Quality
نویسندگان
چکیده
Detecting duplicates is a pervasive data quality challenge that hinders organizations from extracting value from their data sooner. The increased complexity and heterogeneity of modern datasets has lead to the presence of varying record formats, missing values, and evolving data semantics. As data is integrated, duplicates inevitably occur in the integrated instance. One of the challenges in deduplication is determining whether two values are sufficiently close to be considered equal. Existing similarity functions often rely on counting the number of required edits to transform one value to the other. This is insufficient in attribute domains, such as time, where small syntactic differences do not always translate to ’closeness’. In this paper, we propose a duplication detection framework, which adapts metric functional dependencies (MFDs) to improve the detection accuracy by relaxing the matching condition on numeric values to allow a permitted tolerance. We evaluate our techniques against two existing approaches using three real data collections, and show that we achieve an average 25% and 34% improvement in precision and recall, respectively, over non-MFD versions.
منابع مشابه
A New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملA Novel Framework and Model for Data
Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the „name‟ attributes of the data sets. It provides a sequence of algorithms through a novel framework ...
متن کاملUsing Trainable Duplicate Detection for Automated Public Data Refining
Public institutions share important data on the Web. These data are essential for public investigation and thus increase transparency. However, it is difficult to process them, since there are numerous mistypings, disambiguities and duplicates. In this paper we propose an automated approach for cleaning of these data, so that further querying result is reliable. We develop a duplicate detection...
متن کاملSemantic Similarity Match for Data Quality
Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must...
متن کاملAn Efficient Approach for Near-duplicate page detection in web crawling
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...
متن کامل